Goto

Collaborating Authors

 Hanover County


O1 Embedder: Let Retrievers Think Before Action

Yan, Ruiran, Liu, Zheng, Lian, Defu

arXiv.org Artificial Intelligence

The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs' ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.


R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation

Ye, Fuda, Li, Shuangyin, Zhang, Yongqi, Chen, Lei

arXiv.org Artificial Intelligence

Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers. However, a semantic gap exists between LLMs and retrievers due to differences in their training objectives and architectures. This misalignment forces LLMs to passively accept the documents provided by the retrievers, leading to incomprehension in the generation process, where the LLMs are burdened with the task of distinguishing these documents using their inherent knowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill this gap by incorporating Retrieval information into Retrieval Augmented Generation. Specifically, R$^2$AG utilizes the nuanced features from the retrievers and employs a R$^2$-Former to capture retrieval information. Then, a retrieval-aware prompting strategy is designed to integrate retrieval information into LLMs' generation. Notably, R$^2$AG suits low-source scenarios where LLMs and retrievers are frozen. Extensive experiments across five datasets validate the effectiveness, robustness, and efficiency of R$^2$AG. Our analysis reveals that retrieval information serves as an anchor to aid LLMs in the generation process, thereby filling the semantic gap.


Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision

Jiang, Fan, Drummond, Tom, Cohn, Trevor

arXiv.org Artificial Intelligence

Cross-lingual open domain question answering (CLQA) is a complex problem, comprising cross-lingual retrieval from a multilingual knowledge base, followed by answer generation in the query language. Both steps are usually tackled by separate models, requiring substantial annotated datasets, and typically auxiliary resources, like machine translation systems to bridge between languages. In this paper, we show that CLQA can be addressed using a single encoder-decoder model. To effectively train this model, we propose a self-supervised method based on exploiting the cross-lingual link structure within Wikipedia. We demonstrate how linked Wikipedia pages can be used to synthesise supervisory signals for cross-lingual retrieval, through a form of cloze query, and generate more natural questions to supervise answer generation. Together, we show our approach, \texttt{CLASS}, outperforms comparable methods on both supervised and zero-shot language adaptation settings, including those using machine translation.


LinkLogic: A New Method and Benchmark for Explainable Knowledge Graph Predictions

Kumar-Singh, Niraj, Polleti, Gustavo, Paliwal, Saee, Hodos-Nkhereanye, Rachel

arXiv.org Artificial Intelligence

While there are a plethora of methods for link prediction in knowledge graphs, state-of-the-art approaches are often black box, obfuscating model reasoning and thereby limiting the ability of users to make informed decisions about model predictions. Recently, methods have emerged to generate prediction explanations for Knowledge Graph Embedding models, a widely-used class of methods for link prediction. The question then becomes, how well do these explanation systems work? To date this has generally been addressed anecdotally, or through time-consuming user research. In this work, we present an in-depth exploration of a simple link prediction explanation method we call LinkLogic, that surfaces and ranks explanatory information used for the prediction. Importantly, we construct the first-ever link prediction explanation benchmark, based on family structures present in the FB13 dataset. We demonstrate the use of this benchmark as a rich evaluation sandbox, probing LinkLogic quantitatively and qualitatively to assess the fidelity, selectivity and relevance of the generated explanations. We hope our work paves the way for more holistic and empirical assessment of knowledge graph prediction explanation methods in the future.


Topic-Partitioned Multinetwork Embeddings

Neural Information Processing Systems

We introduce a new Bayesian admixture model intended for exploratory analysis of communication networks--specifically, the discovery and visualization of topic-specific subnetworks in email data sets. Our model produces principled visualizations of email networks, i.e., visualizations that have precise mathematical interpretations in terms of our model and its relationship to the observed data. We validate our modeling assumptions by demonstrating that our model achieves better link prediction performance than three state-of-the-art network models and exhibits topic coherence comparable to that of latent Dirichlet allocation. We showcase our model's ability to discover and visualize topic-specific communication patterns using a new email data set: the New Hanover County email network. We provide an extensive analysis of these communication patterns, leading us to recommend our model for any exploratory analysis of email networks or other similarly-structured communication data. Finally, we advocate for principled visualization as a primary objective in the development of new network models.


Automatic Roof Type Classification Through Machine Learning for Regional Wind Risk Assessment

Meng, Shuochuan, Soleimani-Babakamali, Mohammad Hesam, Taciroglu, Ertugrul

arXiv.org Artificial Intelligence

Roof type is one of the most critical building characteristics for wind vulnerability modeling. It is also the most frequently missing building feature from publicly available databases. An automatic roof classification framework is developed herein to generate high-resolution roof-type data using machine learning. A Convolutional Neural Network (CNN) was trained to classify roof types using building-level satellite images. The model achieved an F1 score of 0.96 on predicting roof types for 1,000 test buildings. The CNN model was then used to predict roof types for 161,772 single-family houses in New Hanover County, NC, and Miami-Dade County, FL. The distribution of roof type in city and census tract scales was presented. A high variance was observed in the dominant roof type among census tracts. To improve the completeness of the roof-type data, imputation algorithms were developed to populate missing roof data due to low-quality images, using critical building attributes and neighborhood-level roof characteristics.


Topic-Partitioned Multinetwork Embeddings

Neural Information Processing Systems

We introduce a joint model of network content and context designed for exploratory analysis of email networks via visualization of topic-specific communication patterns. Our model is an admixture model for text and network attributes which uses multinomial distributions over words as mixture components for explaining text and latent Euclidean positions of actors as mixture components for explaining network attributes. We demonstrate the capability of our model for descriptive, explanatory, and exploratory analysis by investigating the inferred topic-specific communication patterns of a new government email dataset, the New Hanover County email corpus.


Topic-Partitioned Multinetwork Embeddings

Krafft, Peter, Moore, Juston, Desmarais, Bruce, Wallach, Hanna M.

Neural Information Processing Systems

We introduce a joint model of network content and context designed for exploratory analysis of email networks via visualization of topic-specific communication patterns. Our model is an admixture model for text and network attributes which uses multinomial distributions over words as mixture components for explaining text and latent Euclidean positions of actors as mixture components for explaining network attributes. We demonstrate the capability of our model for descriptive, explanatory, and exploratory analysis by investigating the inferred topic-specific communication patterns of a new government email dataset, the New Hanover County email corpus. Papers published at the Neural Information Processing Systems Conference.


Topic-Partitioned Multinetwork Embeddings

Krafft, Peter, Moore, Juston, Desmarais, Bruce, Wallach, Hanna M.

Neural Information Processing Systems

We introduce a new Bayesian admixture model intended for exploratory analysis ofcommunication networks--specifically, the discovery and visualization of topic-specific subnetworks in email data sets. Our model produces principled visualizations ofemail networks, i.e., visualizations that have precise mathematical interpretations in terms of our model and its relationship to the observed data. We validate our modeling assumptions by demonstrating that our model achieves better link prediction performance than three state-of-the-art network models and exhibits topic coherence comparable to that of latent Dirichlet allocation. We showcase our model's ability to discover and visualize topic-specific communication patternsusing a new email data set: the New Hanover County email network. We provide an extensive analysis of these communication patterns, leading us to recommend our model for any exploratory analysis of email networks or other similarly-structured communication data. Finally, we advocate for principled visualization asa primary objective in the development of new network models.